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Abstract 

Identifying population structure forms an important basis for genetic and evolutionary studies. Most current methods to 
identify population structure have limitations in analyzing haplotypes and recombination across the genome. Recently, a 
method of chromosome painting in silico has been developed to overcome these shortcomings and has been applied to 
multiple human genome sequences. This method detects the genome-wide transfer of DNA sequence chunks through 
homologous recombination. Here, we apply it to the frequently recombining bacterial species Helicobacter pylori that has 
infected Homo sapiens since their birth in Africa and shows wide phylogeographic divergence. Multiple complete genome 
sequences were analyzed including sequences from Okinawa, Japan, that we recently sequenced. The newer method 
revealed a finer population structure than revealed by a previous method that examines only MLST housekeeping genes 
or a phylogenetic network analysis of the core genome. Novel subgroups were found in Europe, Amerind, and East Asia 
groups. Examination of genetic flux showed some singleton strains to be hybrids of subgroups and revealed evident signs 
of population admixture in Africa, Europe, and parts of Asia. We expect this approach to further our understanding of 
intraspecific bacterial evolution by revealing population structure at a finer scale. 
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Introduction 

Elucidation of population structure is an important basis for 
evolutionary studies in any species (Patterson et al. 2006; 
Robinson et al. 2010; Novembre and Ramachandran 2011) 
including studies such as the inference of demographic his- 
tory and the examination of population differentiation and 
selection (Akey et al. 2004; Holsinger and Weir 2009; Henn 
et al. 2012). Identifying population structures and assigning 
individuals to the identified population categories (groups) 
are central issues to analyzing any genetic data set. They are, 
for example, an important basis for association studies (Devlin 
and Roeder 1999; Nakamura et al. 2005). Two of the most 
popular approaches for identifying population structures 
using genetic data are principal component analysis (PCA) 
(Menozzi et al. 1978) and STRUCTURE (Pritchard et al. 2000; 
Falush, Stephens et al. 2003). In addition to these, phyloge- 
netic tree construction and BAPS (Corander et al. 2008) have 
also been used to describe population structure. 

The increased availability of genome-wide sequence data 
poses challenges for current methods to identify population 
structure accurately and reliably in practical computational 
time. PCA can quickly process large data sets with hundreds 



of thousands of SNPs and thousands of samples (Patterson 
et al. 2006), and phylogenetic trees of core genomes can 
classify prokaryotic species (Snel et al. 2005; Zhi et al. 
2012) and populations within a bacterial species (Kawai 
et al. 2011; Okoro et al. 2012). However, PCA and phyloge- 
netic tree construction are not designed to infer the number 
of populations directly. Furthermore, in PCA and phyloge- 
netic tree construction, correlation (linkage) between SNPs 
and their relative position is not taken into account. BAPS 
has a linkage model option (Corander and Tang 2007). 
STRUCTURE also has a linkage model option that accounts 
for recombination events (admixture) among populations 
and correlations among SNPs that arise in admixed popula- 
tions (Falush, Stephens et al. 2003). But, STRUCTURE re- 
quires the number of populations (K) to be specified in 
advance and typically K< 10 for satisfactory convergence 
(Lawson et al. 2012). 

As a solution to these problems, a new tool called 
fineSTRUCTURE was recently developed (Lawson et al. 
2012). It is based on "chromosome painting" in silico, which 
infers recombination-derived "chunks" and reconstructs hap- 
lotypes on the chromosome of a "recipient" individual as a 
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series of chunks from all other "donor" individuals in the 
sample (Lawson et al. 2012). The results are summarized 
into a "co-ancestry matrix/' which contains the number of 
recombination events from each donor to each recipient in- 
dividual. Using this matrix, fineSTRUCTURE conducts model- 
based clustering of hundreds or thousands of individuals. So 
far, fineSTRUCTURE has only been applied to humans; nev- 
ertheless, this application has already revealed subtle popula- 
tion structures in worldwide populations (Lawson et al. 2012). 

In this study, we have applied this approach for the first 
time to a bacterial species. Knowledge of population structure 
has immediate applications to understanding bacteria's role 
in medicine, agriculture, and the environment (Achtman 
2008; Downing et al. 2011; Argudin et al. 2013). Moreover, 
this knowledge has and will continue to directly affect bacte- 
rial molecular epidemiology and public health management 
among other fields (Robinson et al. 2010). 

The population structures of bacterial species, however, 
are complex and often controversial due to uncertainty 
about the frequency and influence of recombination between 
lineages (Feil and Spratt 2001). Population structures of clonal 
bacteria with low rates of recombination can be inferred by 
phylogenetic trees (Hershberg et al. 2008; Morelli, Song, et al. 
2010; Holt et al. 2012; Okoro et al. 2012), which has been by far 
the most common approach in bacteria (Snel et al. 2005; Zhi 
et al. 2012). However, the extent of recombination varies 
among bacterial species, some of which show much higher 
rates of recombination than others (Perez-Losada et al. 2006; 
Didelot and Maiden 2010). Bacterial recombination is mainly 
intraspecifk but occasionally occurs between species (Hanage 
et al. 2009; Corander et al. 2011). The question of how to 
elucidate the population structures of such highly recombin- 
ing bacterial species using genome-wide sequence data has 
been largely unexplored. 

To address this question, we have focused on Helicobacter 
pylori, a stomach pathogen that infects over half of all 
humans and causes gastritis, ulcers, and cancer (Suerbaum 
and Josenhans 2007). Helicobacter pylori is commonly ac- 
quired in childhood and can persist in the stomach over 
the host's entire lifespan (Suerbaum and Josenhans 2007). 
Helicobacter pylori has been with Homo sapiens since its 
origin in Africa and shows geographic patterns of genetic 
diversity that parallel human diversity (Linz et al. 2007). Of 
the known bacteria, H. pylori is likely the most frequently 
recombining species (Doolittle and Zhaxybayeva 2009) with 
a greater effect of homologous recombination than mutation 
(Morelli, Didelot, et al. 2010; Kennemann et al. 2011) as well 
as signatures of homologous recombination throughout 
the genome (Yahara et al. 2012). It is worth exploring how 
chromosome painting and fineSTRUCTURE could elucidate 
the population structure of genomes subject to such frequent 
recombination events. 

Previously, by STRUCTURE analysis using 7 housekeeping 
(MLST) genes (atpA, efp, mutY, ppa, trpC, urel, and yphC), 
phylogeographic population structures of H. pylori were iden- 
tified: hpEurope, hpSahul, hpEastAsia, hpAsia2, hpNEAfrica, 
hpAfrical, and hpAfrica2 (Falush, Wirth et al. 2003; Linz et al. 
2007; Moodley et al. 2009, 2012). The hpEastAsia population is 



known to have three subpopulations, hspAmerind, hspEAsia, 
and hspMaori (Moodley et al. 2009). Of these subpopulations, 
the complete genome sequences of hspEAsia strains were 
recently obtained (Kawai et al. 2011). In addition to these, 
this study included two newly sequenced strains from 
Okinawa, Japan. Okinawans and mainland Japanese are 
known to be genetically differentiated (Hammer and Horai 
1995; Yamaguchi-Kabata et al. 2008). If the H. pylori strains 
from Okinawa are also genetically differentiated, these strains 
may form a regional subgroup useful for deepening our un- 
derstanding of H. pylori population structure in East Asia. 

Using chromosome painting and fineSTRUCTURE, we an- 
alyzed the complete genome sequences of H. pylori strains 
from various parts of the world. Our analysis, based on eluci- 
dation of genetic flux through homologous recombination 
between subgroups, has revealed the population structure of 
H. pylori at a finer scale than achieved by previous methods. 

Results 

Chromosome Painting In Silico of H. pylori 
Complete Genomes 

To elucidate the population structure of H. pylori, we applied 
the "chromosome painting" algorithm accounting for linkage 
information to complete genome sequences of the two afore- 
mentioned Okinawa strains and 27 public H. pylori strains. 
Each genome (CDSs of all one-to-one orthologous genes in 
the core genomic regions, to be precise) was reconstructed 
using chunks of DNA donated by other individual genomes. 
The result is visualized in figure 1. Chunk donors are colored 
according to donor subgroups (the suffix "sg_" and the con- 
secutive number [e.g., "_sg1"] is used to name each sub- 
group), which we will explain in the next section. The 
median length of a chunk is 14 bp (interquartile range: 5- 
39 bp). The distribution of chunk sizes is shown in supple- 
mentary figure S1, Supplementary Material online. 

Based on the inference of recombination-derived 
chunks and their donors across the genomes, the chromo- 
some painting algorithm calculates the expected number of 
chunks imported from a donor to a recipient genome 
and then summarizes these values into a matrix ("co-ancestry 
matrix"). The matrix for this study is visualized as a heat 
map (fig. 2a). 

Population Structure at a Finer Scale 
Based on the co-ancestry matrix visualized as a heat map, 
individual strains were assigned to subgroups by the 
fineSTRUCTURE clustering algorithm (fig. 2a). Hereafter, we 
use "subgroup" to designate a cluster, and each subgroup is 
named by adding the suffix "sg_" and the consecutive 
number (e.g., "Europe_sg1"). Unlike STRUCTURE, this algo- 
rithm can infer the number of clusters (K) and partition the 
strains into K subgroups with indistinguishable genetic ances- 
try. Based on likelihood of the co-ancestry matrix, the infer- 
ence is performed by a Bayesian MCMC (Markov chain 
Monte Carlo) approach that explores the space of possible 
partitions by using an algorithm for proposing new partitions 
(Lawson et al. 2012). 
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Fic. 1. Chromosome painting in silico. Each lane indicates the chromosome of a strain shown on the right. The strains are classified by fineSTRUCTURE 
into subgroups labeled by colors (table 1 and fig 2) on the left. A color along the chromosome indicates the subgroup that donated a chunk of SNPs 
through homologous recombination. All genomic positions are transformed to those of a reference strain (26695). 



The results of the model-based clustering are represented 
as a tree to the left of the co-ancestry matrix (fig. 2). 
The names of individual strains are replaced with those 
of the identified subgroups (table 1). We compared the 
results with those from the traditional STRUCTURE algorithm 
that is limited to seven housekeeping genes (also shown in 
table 1). 

The results for Africal and Asia2 were the same for both 
methods. The European strains identified by STRUCTURE 
(hpEurope) were divided by fineSTRUCTURE into two sub- 
groups and two singletons (SJM180, PeCan4). We will exam- 
ine the two singleton strains later (in the "Hybrid genomes" 
section). The hpEastAsia strains (hspEAsia and hspAmerind 
strains) showed the clearest difference: fineSTRUCTURE 
divided hspAmerind into four subgroups including two sin- 
gletons (we named them Amerind_sg1, Amerind_sg2, 
Amerind_sg3, and Amerind_sg4) and hspEAsia into four 
subgroups including one singleton (we named them as 
EastAsia_sg1, EastAsia_sg2, EastAsia_sg3, and EastAsia_sg4). 

Comparison with Phylogenetic and No-Linkage 
Methods 

From the same genomic data, we also constructed a 
phylogenetic network (fig 3). Within each clade, the strains 
are colored according to the population assignment by 



fineSTRUCTURE. The result appeared globally consistent 
with the fineSTRUCTURE population assignments. The 
clades of East Asia, Amerind, Asia2, Europe, and Africal are 
seemingly polytomous, suggesting rampant recombination. 

We also constructed a traditional phylogenetic tree 
(supplementary fig. S2, Supplementary Material online). 
The tree topology was also nearly consistent with the 
fineSTRUCTURE population assignments. However, the 
bifurcating branches in East Asia are short and contain 
lower bootstrap values, suggesting that the inference is 
not robust. This likely reflects a difference between the 
two methods with respect to whether they account for 
the occurrence of homologous recombination. 

We also conducted the no-linkage approach of chromo- 
some painting and fineSTRUCTURE treating markers 
as independent as PCA (Lawson et al. 2012), which 
ignores linkage and corresponds to assuming that recom- 
bination rate between any pair of markers is infinite. 
The result of the population assignment (supplementary 
fig. S3, Supplementary Material online) was also almost 
the same as that by the linkage approach. 

Hybrid Genomes 

In the phylogenetic network and tree (fig. 3 and supplemen- 
tary fig S2, Supplementary Material online), there is a clear 



1456 



Chromosome Painting In Silico in Bacterial Species • doi:10.1093/molbev/mst055 



MBE 



(a) 




1 £ 

T- CO 



s 



I — 



< 00 



03 
Cl 

o 

1— 
=3 
LU 



C7> 



o 
LU 



TO O 



r4 co 
O) q) cn en 
W at Jk «k 



I I 



"l "l 

«= E 

'iZ 'i— 
CO 0J 

E E 



to 



« 0) c c c 

<□_<<<< 



CO TT 

Q3 Ol 

i i 

eg to 

to </> 

(ft trt 

(0 q 

LU LU 



i+ 



Africa 1_sg1 [3 
SJM180 

Europe_sg1 
Europe_sg2 

Asia2_sg1 [ 
PeCan4 
Amerind_sg1 [ 

Amerind_sg2 [ 
Amerind_sg3 
Amerind_sg4 



EastAsia_sgi 



EastAsia 



i_sg2 £ 



EastAsia_sg3 [ 



EastAsia_sg4 




(b) 




£t CN CO ^ 

5? SP 5? 5? 
W W WW 



C7J 



t/3 



co *r 



K2 
r+ 



Africa 1_sg1 
SJM180 

Europe_sg1 
Europe_sg2 

Asia2_sg1 r 
PeCan4 
Amerind_sg1 [ 

Amerind_sg2 [ 
Amerind_sg3 
Amerind_sg4 ~* 




• EastAsia_sgl 

— 1« EastAsia_sg2 | 

__H* EastAsia_sg3 

• EastAsia_sg4 

Fig. 2. Co-ancestry matrix with population structure and genetic flux. The color of each cell of the matrix indicates the expected number of chunks 
imported from a donor genome (column) to a recipient genome (row). The name of each strain is indicated on the right, (a) Population assignments 
and genetic flux. The tree in the left shows clustering for assignment of the listed population subgroups. The two black-lined boxes indicate asymmetry 
in genetic flux between EastAsia/Amerind and the other subgroups, (b) Hybrid strains and admixed subgroups. Two singleton strains, SJM180 and 
PeCan4, are hybrids as indicated by the gray dashed boxes. Signs of population admixture in Africal, Europe, and Asia2 are indicated by bold black 
dashed boxes, whereas those in EastAsia_sg3 and EastAsia_sg4 are indicated by thin black dashed boxes. 
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Table 1. Comparison of Population Assignment. 



Strain 



fineSTRUCTURE 
(Linkage Model) 



STRUCTURE 



J99 

Gambia94 

SJM180 

26695 

HPAC1 

Lithuania75 

P12 

C27 

B38 

B8 

India 7 

Santal49 (SNT49) 

PeCan4 

Puno120 

Puno135 

Sat464 

Shi470 

Cuz20 

v225d 

35A 

F57 

F30 

F16 

83 

OK310 
51 
52 
F32 

OK113 




singleton (hybrid) 
Europe_sg1 



hpAfrical 
hpEurope 




singleton (hybrid) 
Amerind_sg1 

Amerind_sg2 

singleton (Amerind_sg3) 
singleton (Amerind_sg4) 
EastAsia_sg1 

EastAsia_sg2 

EastAsia_sg3 
singleton (East 



hpAsia2 

hpEurope 
hpEastAsia 
(hspAmerind) 




hpEastAsia 
(hspEAsia) 



Note. — "sg" is abbreviated from "subgroup." 



separation between the Western (Europe and Africa) strains 
and the East Asian strains, a result that is also seen in the 
co-ancestry matrix (fig. 2a). 

An exception to this separation is PeCan4. In the 
phylogenetic network (fig. 3), the PeCan4 genome 
appears to be a hybrid between Amerind genomes and 
Western genomes. The co-ancestry matrix (fig. 2) indi- 
cates that PeCan4 is a hybrid strain in that it has 
received a considerable number of chunks from the 
Amerind, Africal _sg1, and European strains. The co-an- 
cestry matrix (fig. 2) also indicates that SJM180 is a 
hybrid between Africa1_sg1 and European subgroups. 
This interpretation is consistent with the phylogenetic 
network (fig. 3). Thus, comparison of the co-ancestry 
matrix with the phylogenetic network reveals novel 
genome characteristics of these two strains, demonstrat- 
ing the power of chromosome painting. 

Signatures of Admixture Events 

The boxes in figure 2b indicate evident signs of admixture. 
Here, we use the term "admixture" when there is a flow of 
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Fig. 3. Phylogenetic network. The colors indicate subgroups identified 
by fineSTRUCTURE (as in fig. 2 and table 1). Scale bar indicates substi- 
tutions per nucleic site. 



chunks (more generally, "genetic flux") between (sub)groups. 
It seems that the two European subgroups were evidently 
admixed with Africal and Asia2 populations, which is consis- 
tent with previous reports using AALST data (Falush, Wirth 
et al. 2003; Moodley et al. 2012). Conversely, Africal and Asia2 
were evidently admixed with European populations, which is 
also consistent with a previous report (Linz et al. 2007). 
Interestingly, signs of admixture can also be seen in 
EastAsia_sg3 and EastAsia_sg4 (OK113 from Okinawa) sub- 
groups — 29-31% of the chunks were estimated to be im- 
ported from European and Asia2 populations, which is 
significantly higher than imported into the other subgroups 
in East Asia and Amerind (P < 0.005, Wilcoxon's rank sum 
test). Although this seems to be a result of relatively recent 
admixture, there is currently no way to infer the date of ad- 
mixture based on the co-ancestry matrix. 

Asymmetry in Genetic Flux 

In general, export from country A to country B might be 
larger than import from country B to country A. The co- 
ancestry matrix revealed such asymmetries in the genetic 
flux between two populations. Figure 2a indicate East Asia 
and Amerind populations imported a smaller number of 
chunks than they exported (P<10 -15 , Wilcoxon's rank 
sum test) to external populations (African, European, and 
Asia2 populations). We will discuss its interpretation 
later ("Evolution of Amerind/ East Asia H. pylori" in 
Discussion). 

Global View of Genetic Fluxes among Subgroups 
For the subgroups identified by fineSTRUCTURE, we visual- 
ized the extents of genetic fluxes. Using the co-ancestry 
matrix, the proportion of chunks copied from a donor to a 
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Fig. 4. Genetic fluxes between subgroups. Width of an arrow (in three grades) indicates the extent of flux. Arrows representing a small flux were 
omitted for clarity as explained in the text. Color of an arrow indicates the donor. 



recipient subgroup on average was calculated (supplementary 
table S1, Supplementary Material online). The result is shown 
in figure 4. Arrows with the average proportion of chunks 
copied from a donor to a recipient subgroup less than 5.4% 
(first quartile of the values of all arrows) were omitted. 
Singletons were not included in the figures for simplicity. 
The figure illustrates that genetic flux into the admixed 
subgroups described earlier (Africa1_sg1, Europe_sg1, 
Europe_sg2, Asia2_sg1, and East Asia_sg3) is relatively large. 
Genetic flux into other subgroups can also be seen, indicating 
a complex network of genetic flux among H. pylori 
populations. 

Discussion 

Comparison of Methods for Population Structure 
Inference 

We were the first group to apply "chromosome painting" in 
silico and fineSTRUCTURE to organisms other than humans 
in this study. We chose the bacterium H. pylori because 
humans and H. pylori are both highly recombining and 
share, to some extent, a phylogeographically differentiated 
population structure (Moodley and Linz 2009). 

A recent study in humans demonstrated that chromo- 
some painting followed by fineSTRUCTURE analysis is able 
to capture a more subtle, recent population structure com- 
pared with structures generated from STRUCTURE analysis or 
PCA (Lawson et al. 2012). The advantage of chromosome 
painting and fineSTRUCTURE became evident when ac- 
counting for linkage information, an advantage that is sup- 
ported in a recent review comparing various algorithms for 
population identification (Lawson and Falush 2012). In this 
study, the analysis of chromosome painting and 
fineSTRUCTURE indeed revealed a population structure 



with more subgroups and singletons than were revealed by 
previous methods. Meanwhile, the benefit of accounting for 
linkage information was not clear because almost the same 
population structure was inferred by the no-linkage ap- 
proach, which ignores linkage information as PCA. This is 
probably because the linkage is very weak in H. pylori because 
of its high recombination. Under such a condition with a large 
number of markers without linkage, fineSTRUCTURE and 
PCA utilize similar information (Lawson et al. 2012). 
However, we think there are at least three other major ad- 
vantages to using the new method. 

First and most importantly, chromosome painting 
and fineSTRUCTURE can elucidate the extent and direction 
of genetic flux between subgroups as shown in figure 2. 
This is impossible for the phylogenetic trees (supplementary 
fig. S2, Supplementary Material online), which do not 
take genetic flux between subgroups into account. The 
neighbor-net phylogenetic networks (fig. 3) can also visualize 
genetic flux between subgroups. However, their complex sig- 
natures are hard to interpret, and their output gives no 
information about the direction of genetic flux. Meanwhile, 
the new method is useful in visualizing the extent and 
direction of genetic flux and has disentangled the 
complex network interconnected by rampant recombination 
events. 

The second advantage is its computational efficiency. 
STRUCTURE can also handle hundreds of thousands 
of SNPs as seen in human population genetic studies 
(Jakobsson et al. 2008). However, chromosome painting 
combined with fineSTRUCTURE is much faster than 
STRUCTURE. Furthermore, the computation is easily paralle- 
lizable and can therefore be applied to hundreds of genome 
sequences. 
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The third advantage is that chromosome painting is appli- 
cable even when sample size is small as in this study. In con- 
trast, STRUCTURE has difficulty in population assignment for 
small sample sizes (Rosenberg et al. 2002; Yang et al. 2005). It 
requires at least 15-20 individuals per hypothesized popula- 
tion to achieve accurate clustering (Rosenberg et al. 2001 ) due 
to its assumptions of Hardy- Weinberg equilibrium and link- 
age equilibrium within each population. In this study, we 
applied STRUCTURE to the seven housekeeping (AALST) 
genes of more than 1,000 strains, but we were unable to 
apply STRUCTURE to the data of genome-wide SNPs of the 
29 strains. Similarly, we were unable to analyze the data by 
another popular program BAPS because it is also based on 
assumptions of Hardy-Weinberg equilibrium and linkage 
equilibrium within each population, which becomes prob- 
lematic with small sample sizes. 

Genetic Flux between Subgroups 
As mentioned earlier, an essential advantage of our approach 
is the ability to determine the extent and direction of genetic 
flux between identified population subgroups. Genetic flux 
among populations through homologous recombination has 
begun to be examined at the genomic scale. A recent study 
(Didelot et al. 2012) inferred the extent of genetic flux be- 
tween four phylogroups in Escherichia coli by using 
ClonalOrigin (Didelot et al. 2010). ClonalOrigin assumes a 
clonal genealogy with some additional edges representing 
recombination. The clonal genealogy can be inferred by 
ClonalFrame (Didelot and Falush 2007b), which has been 
used in another related study (Didelot et al. 201 1) that quan- 
tified the flux between five lineages in Salmonella. According 
to the developers however, this method is not appropriate if 
the recombination rate is high enough to obscure clonal 
structure, as is the case with H. pylori (Didelot and Falush 
2007a). Therefore in this study, we used chromosome paint- 
ing and fineSTRUCTURE instead. 

A recent study developed another method to detect ho- 
mologous recombinant segments imported from external 
populations (Marttinen et al. 2012). The application of this 
method to 241 genomes of Streptococcus pneumoniae dem- 
onstrated its ability to handle hundreds of genome sequence. 
Unlike chromosome painting however, it does not model 
recombination events between the observed sequences. 

Another recent study reconstructed recombination events 
between lineages of Chlamydia trachomatis (Harris et al. 
2012). The study succeeded in inferring donor branches by 
combining ancestral sequence reconstruction and a test sta- 
tistics for detecting recombination (Croucher et al. 2011) to 
identify genomic regions in which the node sequence was 
likely derived from a distant branch on the phylogenetic 
tree. However, this method reconstructs a phylogenetic tree 
after removing the effects of recombination. Such a removal 
would make phylogenetic tree construction impossible for a 
species like H. pylori, where recombination occurs throughout 
the genome. Chromosome painting and fineSTRUCTURE are 
a more appropriate choice for such a highly recombining 



organism because they do not depend on reconstruction of 
a phylogenetic tree. 

Evolution of Amerind/East Asia H. pylori 
In the analysis by chromosome painting and 
fineSTRUCTURE, Amerind and East Asian H. pylori were in- 
teresting because more subgroups were identified compared 
to the other groups of H. pylori. Additionally, these two sub- 
groups imported a significantly smaller number of chunks 
than they exported to external populations. Subgroups in 
the East Asian group (EastAsia_sg3 and EastAsia_sg4) also 
showed stronger signatures of population admixture. In the 
following paragraphs, we discuss relationships between these 
findings and previous studies on H. pylori and their host 
Homo sapiens. 

A previous study on Amerind showed a distinctive H. 
pylori population in the Peruvian Amazon that had differen- 
tiated from Peruvian, Spainish, and Japanese strains (Kersulyte 
et al. 2010). The H. pylori strain Shi470 included in this study 
was sampled from the Peruvian Amazon, and phylogenetic 
analysis using a single locus identified no additional subgroups 
(Kersulyte et al. 2010). Our analysis was able to distinguish as 
many as four distinct subgroups (Amering_sg1 through 
Amerind_sg4) as well as a hybrid (PeCan4) for the Amerind 
area. 

With regard to East Asia, we have recently obtained the 
complete genome sequences of four Japanese strains and 
examined the genomic characteristics of H. pylori in East 
Asia (Kawai et al. 2011). This previous study did not detect 
any subgroups and did not examine any sign of demographic 
events or genetic flux between subgroups. This study is the 
first to report these characteristics of H. pylori in East Asia. 

Studies on human evolution in East Asia have been actively 
conducted using mitochondrial DNA, Y chromosomes, and 
genome-wide SNPs (Hammer and Horai 1995; Horai et al. 
1996; Yamaguchi-Kabata et al. 2008; Ding et al. 2011; Peng 
and Zhang 2011). The ancestral Japanese human populations 
likely originated from two major migration events from the 
Asian continent ("the dual-origin hypothesis"), resulting in 
the Jomon (proposed to be a direct ancestor of Okinawa 
people) and Yayoi populations (Hanihara 1991; Hammer 
and Horai 1995; Horai et al. 1996; Hammer et al. 2006). 
Genetic studies support the idea that modern mainland 
Japanese derived from an admixture of the Yayoi and 
Jomon people. Among the East Asian subgroups of H. pylori 
identified by the present study, EastAsia_sg1 is purely (4/4) 
from Japan (excluding Okinawa), EastAsia_sg2 is from Japan, 
Korea, and Okinawa, EastAsia_sg3 is from Japan and Korea, 
and EastAsia_sg4 is from Okinawa. Parts of EastAsia_sg3 and 
EastAsia_sg4 showed stronger signatures of population ad- 
mixture. Compared with these subgroups (EastAsia_sg3 and 
EastAsia_sg4), the signature of population admixture was 
weaker in the purely Japanese subgroup, EastAsia_sg1 . 
These differences may reflect historical human population 
movements and admixture during the formation of 
modern human populations in East Asia. However, we cur- 
rently have no additional evidence to connect EastAsia_sg1 
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with the Yayoi people or EastAsia_sg3 and EastAsia_sg4 
groups with the Jomon people. 

Additionally, it is interesting to consider what mechanism 
caused the asymmetric genetic flux in the Amerind and East 
Asian groups. The observation that they imported a smaller 
number of chunks than they exported to the external popu- 
lations (African, European, and Asia2 populations) could in- 
dicate a less efficient genetic mechanism to import foreign 
DNA, or a different selective environment which had sup- 
pressed such imports in the past. The former possibility could 
be tested directly by experimentation, and the latter could be 
examined by simulating histories of populations with differ- 
ent scenarios of selection and comparing patterns of genetic 
flux in co-ancestry matrices. 

Influence of Sample Structure and Concluding 
Remarks 

This study used 29 complete genome sequences of H. pylori. 
The number of available genome sequences of H. pylori has 
been limited so far, and the genomes used in this study do not 
represent a systematic sample of the species' diversity. 
Sampling bias will have strong effect on inference of popula- 
tion structure and admixtures. 

For example, genomes of the hpSahul population (in New 
Guinea and Australia) (Moodley et al. 2009) are not available 
and thus were not included in this study. They split from 
Asian populations of H. pylori before the Asian populations 
split into hpAsia2 (Central Asia) and hpEastAsia. Most likely, 
inclusion of the hpSahul genomes would reveal an additional 
subgroup(s) with additional information about genetic flux. 
Some chunks in the Amerind and East Asia genomes would 
probably be inferred to have derived from this unexamined 
region. Such results might affect the currently observed asym- 
metric genetic flux in the Amerind and East Asian groups. It 
should be noted that inference of admixture will sensibly 
depend on bias of donor genomes in the data set. To avoid 
this problem, we need to sample a broad range of genomes in 
a way that reflects the diversity of the species. 

Similarly, the inference of hybrid ancestry of the singleton 
strains can also be affected by the sampling bias. If a closely 
related strain were found and included in the analysis, it 
would be clustered with the previous singleton strains, pos- 
sibly resulting in the disappearance of the evident signature of 
hybridization with other populations. We should keep this 
possibility in mind when interpreting the hybrid ancestry of 
singleton strains. 

Therefore, sampling bias currently limits our ability to in- 
terpret the results of population structure and admixture. 
Questions such as how many subgroups exist in a region 
and to what extent they are admixed with other subgroups 
will be difficult to answer reliably without reducing this bias. 
By including more genome sequences in future analyses, we 
could reduce the sampling bias and interpret the results with 
greater confidence. Because the algorithms we used in this 
study can be applied to hundreds of genome sequences, it will 
be interesting and desirable to analyze more genome 



sequences that have been systematically sampled from vari- 
ous regions. 

In summary, by making use of chromosome painting and 
fineSTRUCTURE algorithms on complete genome sequences 
of a bacterial species, we were able to reveal both population 
structure at a finer scale as well as the extent and direction of 
genetic flux between subgroups of a bacteria subject to fre- 
quent recombination events. This procedure will form a basis 
for applying the novel algorithms to large-scale population 
genomic data of other species. 

Materials and Methods 

Sequencing H. pylori Strains from Okinawa 
Helicobacter pylori strains collected in Okinawa Prefectural 
Chubu Hospital, Uruma, Okinawa, Japan (OK107, OK130, 
OK139, OK144, OK155, OK160, OK180, OK181, OK185, 
OK187, OK204, and OK210 (Yamazaki et al. 2005); OK113, 
OK168, OK308, OK310 (Satomi et al. 2006); OK112, (Azuma 
et al. 2004); OK188 (this study) were used. OK188 was sepa- 
rated and cultivated from an epithelium biopsy tissue from a 
patient with duodenal ulcer. For strains carrying cagA genes, 
phylogenetic trees of cagA genes were drawn and OK107, 
OK113, OK130, OK139, OK144, OK155, OK160, OK180, 
OK181, OK185, OK187, OK204, OK210, OK308, OK310, and 
OK168 all carried cagA of the J-Western type (Truong et al. 
2009; Furuta, Yahara, et al. 2011). Standard multilocus se- 
quence typing was applied to all strains. Nucleotide sequence 
data of seven housekeeping (MLST) genes of other H. pylori 
strains were downloaded from the database pubMLST 
(http://pubmlst.org/, last accessed April 5, 2013). In the re- 
sulting neighbor-joining phylogenetic tree (supplementary fig. 
S4, Supplementary Material online), OK 113, OK310, OK188, 
OK130, OK139, OK204, and OK168 cluster with East Asian 
strains. Among them, OK 113 and OK310 were chosen for 
sequencing. OK112 and OK210 are similar to each other and 
cluster with European strains 26695, G27 and P12. OK107, 
OK144, OK155, OK160, OK180, OK181, OK185, OK187, and 
OK308 are very similar to one another and cluster with a 
European strain HPAG1. 

From OK 113 and OK310, genomic DNA was isolated as 
described earlier (Kawai et al. 2011). In brief, cells were inoc- 
ulated onto a fresh TSA-II plate from 20% glycerol stock and 
cultured at 37 °C for 3 days under micro-aerobic conditions 
(0 2 , 5%; C0 2 , 15%; N 2 , 80%). Colonies were collected and 
transferred into 20 ml of Brucella broth culture medium 
with 10% fetal calf serum and cultured at 37 °C for 3 days 
under micro-aerobic conditions. Genomic DNA was ex- 
tracted from culture pellets by the protease/phenol-chloro- 
form method and eluted in 300 |il of TE buffer (10mM-Tris 
HCI, 1 mM EDTA). 

Genome sequences were determined by the whole- 
genome shotgun strategy using Sanger sequencing. 
Approximately 20 jig of genomic DNA was sheared using a 
HYDROSHEAR (Gene Machine). DNA fragments were frac- 
tionated by agarose gel electrophoresis and subcloned into 
the plasmid pTS1 vector (Nippon Gene) to construct shotgun 
libraries with an average insert size of 3 and 10 kb using the 
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3730x1 sequencer (Applied Biosystems). Template DNA was 
prepared from an aliquot of bacterial culture by amplifying 
the inserted DNA of each clone using PCR. 

We produced 19,200 and 17,664 reads of the genomes of 
strains 113 and 310 by sequencing both ends of the clones, 
giving 7.9- and 7.7-fold coverage, respectively. The sequencing 
reads were assembled with the Phred-Phrap-Consed program 
(Gordon et al. 2001) and gaps were closed by direct sequenc- 
ing of either clones that spanned the gaps or PCR products 
amplified with oligonucleotide primers designed to anneal to 
each end of the neighboring contigs. Finally, a finished se- 
quence with an error rate of less than 1 per 10,000 bases 
(QV - 40) was obtained. The obtained sequences were depos- 
ited in DDBJ with the following accession numbers: OK113, 
AP012600; OK310 chromosome, AP012601; and OK310 plas- 
mid, AP012602. 

Genome Sequences and Alignments of Their Genes 
We used complete genome sequences of 29 H. pylori strains 
collected from various parts of the world. Names and acces- 
sion numbers are as follows: 26695, NC_000915.1; J99, 
NC 000921.1; HPAG1, NC_008086.1 and NC 008087.1; 
Shi470, NC_01 0698.2; G27, NC_01 1333.1 and NC_01 1334.1; 
P12, NC_01 1498.1 and NC_01 1499.1; F57, DDBJ:AP011945; 
F32, DDBJ:AP011943 and AP011944; F30, DDBJ:AP011941 
and AP011942; F16, DDBJ:AP011940; B38, NC 01 2973.1; 51, 
CP000012.1; 53bP001 680.1; v225d, CP001 582.1 and 
CP001 583.1; B8, NC_01 4256.1 and NC_01 4257.1; SJM180, 
NC 014560.1; PeCan4, NC_014555.1 and NC 014556.1; 
Cuz20, CP002076.1; Sat464, CP002071.1 and CP002072.1; 
OK113, DDBJ: AP012600; OK310, DDBJ: AP012601 and 
AP012602; 35A, CP002096.1; 83, CP002605.1; Gambia94, 
CP002332.1 and CP002333.1; India7, CP002331.1; 
Lithuania75, CP002334.1 and CP002335.1; Puno120, 
CP002980.1 and CP002981.1; Puno135, CP002982.1; 
Santal49, CP002983.1 and CP002984.1. 

An entire data set of one-to-one orthologous genes in the 
core genomic regions was prepared through ortholog cluster- 
ing by CoreAligner (Uchiyama 2008), DomClust (Uchiyama 
2006), and RECOG (http://mbgd.genome.ad.jp/RECOG/, last 
accessed April 5, 2013). Alignment of each orthologous gene 
was conducted by MAFFT (Katoh et al. 2005). SNP call of each 
orthologous gene was conducted by adegenet (Jombart 
2008). We combined the SNPs while preserving information 
of SNP positions to prepare genome-wide haplotype data. 
The genome of strain 26695 was used as a reference to 
record and examine positions of SNPs. 

Chromosome Painting In Silico and fineSTRUCTURE 
Analysis 

"Chromosome painting" was applied to the genome-wide 
haplotype data by the linkage model implemented in 
ChromoPainter (Lawson et al. 2012) (version 0.02). We fol- 
lowed the instructions from the official web page (http:// 
www.paintmychromosomes.com, last accessed April 5, 
2013). We prepared a recombination map file by specifying 
the same recombination rate per-site per-generation for the 



SNPs based on previous estimates of recombination rate and 
generation time (Webb and Blaser 2002; Morelli, Didelot, et al. 
2010). The results were visualized in UCSC (The University of 
California Santa Cruz) browser (Schneider et al. 2006) after 
filtering SNPs with uncertain estimates of their donor and 
chunks (a series of SNPs with the same expected donor) of 
more than 20 kb (because of sparse distribution of SNPs). 

For fineSTRUCTURE (version 0.02) (Lawson et al. 2012), 
both the burn-in and Markov chain Monte Carlo (MCMC) 
chain after the burn-in were run for 100,000 iterations. The 
thin interval was specified as 100. We performed the inference 
twice at the same parameter values and confirmed the pop- 
ulation assignments. 

Population Assignment by STRUCTURE Using 
Multilocus Sequence Typing Data 
We also conducted population assignment by the "no admix- 
ture" model of the program STRUCTURE version 2.0 as de- 
scribed in previous studies (Linz et al. 2007; Moodley et al. 
2009). We used nucleotide sequences of 7 housekeeping 
(MLST) genes of all H. pylori strains registered in the database 
pubMLST (http://pubmlst.org/, last accessed April 5, 2013) 
plus those of strains we sequenced in earlier work (Furuta, 
Kawai, et al. 2011; Kawai et al. 2011) and in this work. A burn- 
in and MCMC chain after burn-in were conducted for 10,000 
and 20,000 iterations, respectively. We varied the parameter K 
(number of population) from 7 (Moodley et al. 2009) through 
11, and used K = 9, which maximized likelihood. Similar to 
previous works (Linz et al. 2007; Moodley et al. 2009), we 
assigned each genome into a traditional group (e.g., 
hpEurope). 

Construction of a Phylogenetic Network and Tree 
A neighbor-net phylogenetic network and a neighbor-joining 
tree were also constructed from concatenated nucleotide 
sequence alignments of the genome-wide haplotype by 
SplitsTree4 (Huson and Bryant 2006) and by MEGA5 
(Tamura et al. 2012), respectively. 

Supplementary Material 

Supplementary figures S1-S4 are available at Molecular 
Biology and Evolution online ( http://www.rn be.oxford 
journals.org/). 
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